Introduction to Unix & Linux: Basics, Commands, and File Formats

Introduction to Unix & Linux: Basics, Commands, and File Formats
A comprehensive beginner's guide to mastering the command line, file systems, and essential Unix/Linux concepts for Windows users transitioning to WSL2.
Getting Started: Installing and Configuring WSL2
Installing WSL2
Windows Subsystem for Linux (WSL2) brings the power of a Linux environment directly to your Windows machine. Detailed instructions are available on the Microsoft documentation page.
Open Windows PowerShell as administrator
Run the command: wsl --install
Restart your computer after installation completes
Create a username and password for Ubuntu when prompted
You should now have access to a fully functional Ubuntu Linux terminal that behaves like a regular Ubuntu server, providing an authentic Unix experience right on your Windows desktop.
Configuring WSL2
Optimize your WSL2 environment by creating symbolic links to commonly used Windows folders. Your C:\ drive is accessible at /mnt/c/:
ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("MyDocuments")' | tr -d '\r')) ~/Documents
ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("Desktop")' | tr -d '\r')) ~/Desktop
ln -s $(wslpath $(powershell.exe '[environment]::getfolderpath("UserProfile")' | tr -d '\r'))/Downloads ~/Downloads
Set Ubuntu as Default Terminal
Search for and open the "Terminal" application
Click on the down arrow in the toolbar
Click on "Settings"
Under "Default Profile" select "Ubuntu"
The Origins: Unix and the Birth of Linux
1
1970s - Unix Created
Unix was created at Bell Labs by Ken Thompson and Dennis Ritchie. Designed as a multi-user, multitasking operating system with a powerful command line interface, Unix revolutionized computing by introducing concepts that remain fundamental today.
2
1991 - Linux Launched
Linux, a Unix-like operating system, was launched by Linus Torvalds as an open-source kernel. This free alternative to proprietary Unix systems democratized access to powerful computing tools and sparked a global collaboration movement.
3
Today - Everywhere
Today, Linux powers everything from smartphones to supercomputers, showcasing its remarkable adaptability and widespread adoption. From Android devices to the world's fastest supercomputers, Linux has become the backbone of modern computing infrastructure.
Why Unix/Linux? The Power of the Command Line
Efficiency & Automation
Text-based shell interface enables powerful automation, scripting, and remote management, dramatically boosting productivity and allowing repetitive tasks to be completed in seconds.
Speed & Precision
Commands are terse but powerful, designed for efficiency and speed in executing complex tasks. What takes dozens of clicks in a GUI can be accomplished with a single line of code.
Case Sensitivity
A critical distinction: 'File' is not the same as 'file' in this environment, requiring precise command input. This strictness ensures accuracy and prevents ambiguous operations.
Scripting Capabilities
Shells like Bash and Zsh provide robust scripting and command chaining capabilities for advanced workflows, enabling you to automate entire processes and analyses.
Advanced Command Line Usage
The Unix shell allows you to run complex operations with just a few commands, interact seamlessly with high-performance computing servers, and write reproducible analysis scripts. Whether you're processing gigabytes of genomic data or managing hundreds of files simultaneously, the command line provides unmatched power and flexibility.
Navigating the File System: Basic Commands
Linux distributions like Ubuntu and Kali already have a terminal available. On Ubuntu, press Ctrl + Alt + T to open it quickly.
Essential Navigation Commands
The basic syntax of a command is: command -options argument. For example, ls -l Documents would list the contents of the Documents directory in a detailed long format.
pwd — Print current directory path
ls — List files and directories; options: -l (detailed), -a (all files including hidden)
cd — Change directory; special shortcuts: .. (up one level), ~ (home directory)
Most commands support a --help option to display usage information, e.g., ls --help
These commands are your compass for navigating the structured world of Unix/Linux file systems, allowing you to move through directories and inspect their contents with ease.
cd /mnt/c/Users/hp/Downloads
Understanding Files and Folders in Ubuntu
The filesystem manages files and directories, organizing data hierarchically like a tree. Files hold information, while directories (or folders) contain files or other directories.
The image above illustrates a typical directory structure, showing user home directories like "larry," "imhotep," and "ubuntu" within the home directory. The home directory itself is located at the root of the filesystem, represented by a single / (slash). This root is the top-most directory from which all other directories branch.
Understanding Your Location
When using the shell, we navigate this hierarchy. Your current location is called the current working directory. To find out where you are, use the pwd command:
pwd
/home/ubuntu
The output, such as /home/ubuntu, indicates your home directory, which is your default location when opening a new terminal. "Ubuntu" in this case is the username.
Understanding Paths
This way of specifying locations is called a path:
/ at the start denotes the root directory
home is a folder within the root
Subsequent / characters act as separators between folders
ubuntu is the final folder in this specific path
Note that the / character has two meanings: it represents the root directory when at the beginning of a path, and it acts as a separator within a path.
Listing Files and Changing Directories
Listing Files
To view the contents of your current directory, use the ls (listing) command:
ls

Documents    Downloads    Music        Public
Desktop      Movies       Pictures     Templates
This displays all visible files and folders in your current location, giving you an overview of what's available.
Changing Directory
The cd ("change directory") command changes your current working directory. You can specify a directory using an absolute path (starting from the root /), or a relative path (relative to your current directory).
To navigate to a specific path:
cd /home/JNLab_Repo/hands-on/1a/Introduction to the Unix Shell/data-shell/
We can check our current location with pwd. To move up one directory (to the parent directory), use ..:
cd ..
The shell interprets ~ (tilde) at the start of a path as your user's home directory (e.g., /home/ubuntu), making it a quick shortcut to return home.
Tab Completion: Your Time-Saving Tool
Tab completion is one of the most valuable productivity features in the Unix shell. It helps you avoid typing long file and directory names by automatically completing them for you.
01
Type Partial Name
Start typing part of a filename or directory name
02
Press Tab
Press the Tab ↹ key once
03
Auto-Complete
If the name is unique, the shell completes it automatically
04
See Options
If multiple options exist, press Tab ↹ twice to see all possibilities
Practical Example
For example, if you are in /home/ubuntu/Desktop/data-shell and type:
ls mol
then press Tab ↹, the shell automatically completes to:
ls molecules/
Tab completion not only saves time but also prevents typos and helps you discover what files and directories are available. Make it a habit to use Tab ↹ frequently—it will dramatically speed up your workflow!
Managing Files and Directories
mkdir <dir>
Create a new directory to organize your files
touch <file>
Create an empty file or update the timestamp of an existing file
cp <source> <dest>
Copy files or directories to a new location
mv <old> <new>
Move or rename files and directories
rm <file>
Remove files permanently (no recycle bin!)
Practical Example
mkdir my_project
cd my_project
touch report.txt
cp report.txt final_report.txt
mv final_report.txt /home/user/documents/
rm report.txt
Mastering these fundamental commands is crucial for effective file management, enabling you to organize, move, copy, and delete your data efficiently within the Unix/Linux environment. These operations form the backbone of day-to-day file system interactions.
Creating Directories: Step by Step
We now know how to explore files and directories, but how do we create them in the first place? Let's walk through the process of creating and organizing directories effectively.
First, we should see where we are and what we already have. Let's go back to our data-shell directory and use ls to see what it contains:
cd /home/JNLab_Repo/hands-on/1a/Introduction to the Unix Shell/data-shell
ls
README.txt  coronavirus  molecules  sequencing
Now, let's create a new directory called thesis_notes using the mkdir ("make directory") command:
mkdir thesis_notes
The new directory is created in the current working directory. We can verify this with ls:
ls
README.txt  coronavirus  molecules  sequencing  thesis_notes  things.txt
The mkdir command is your primary tool for creating organized directory structures. You can also create nested directories using the -p flag: mkdir -p parent/child/grandchild
What's in a File Name? Understanding File Extensions
You may have noticed that all of the files in our data directory are named "something dot something". For example, README.txt, which indicates this is a plain text file.
The second part of such a name is called the filename extension, and it indicates what type of data the file holds. While Unix/Linux doesn't strictly require extensions to identify file types, they serve as helpful hints for both humans and programs.
.txt
Plain text file containing unformatted text
.csv
Text file with tabular data where columns are separated by commas
.tsv
Similar to CSV but values are separated by tabs
.log
Text file containing messages produced by software while it runs
.pdf
Portable Document Format for formatted documents
.png
Portable Network Graphics image file
Remember: In Unix/Linux, file extensions are conventions, not requirements. The system determines file type by examining the file's contents, not its name. However, using appropriate extensions makes your files more organized and easier to work with.
Moving and Renaming Files
In our data-shell directory we have a file called things.txt, which contains a note of books to read for our thesis. Let's move this file to the thesis_notes directory we created earlier, using the mv ("move") command:
Moving Files
mv things.txt thesis_notes/
The first argument tells mv what we're "moving", while the second is where it's to go. In this case, we're moving things.txt to thesis_notes/. We can check the file has moved there:
ls thesis_notes

things.txt
Renaming Files
This isn't a particularly informative name for our file, so let's change it! Interestingly, we also use the mv command to change a file's name. Here's how we would do it:
mv thesis_notes/things.txt thesis_notes/books.txt
In this case, we are "moving" the file to the same place but with a different name.
Important: Be careful when specifying the target file name, since mv will silently overwrite any existing file with the same name, which could lead to data loss.
The command mv also works with directories, and you can use it to move or rename an entire directory just as you use it to move an individual file.
Removing Files and Directories
The Unix command used to remove or delete files is rm ("remove"). For example, let's remove one of the files we copied earlier:
rm backup/cubane.pdb
We can confirm the file is gone using ls backup/.
Removing Directories
What if we try to remove the whole backup directory we created in the previous exercise?
rm backup

rm: cannot remove `backup': Is a directory
We get an error. This happens because rm by default only works on files, not directories.
The rm command can remove a directory and all its contents if we use the recursive option -r, and it will do so without any confirmation prompts:
rm -r backup
Deleting Is Forever
The Unix shell doesn't have a trash bin that we can recover deleted files from (though most graphical interfaces to Unix do). Instead, when we delete files, they are unlinked from the file system so that their storage space on disk can be recycled. Tools for finding and recovering deleted files do exist, but there's no guarantee they'll work in any particular situation, since the computer may recycle the file's disk space right away.
Given that there is no way to retrieve files deleted using the shell, rm -r should be used with great caution (you might consider adding the interactive option rm -r -i).
To remove empty directories, we can also use the rmdir command. This is a safer option than rm -r, because it will never delete the directory if it contains files, giving us a chance to check whether we really want to delete all its contents.
Wildcards: Working with Multiple Files
Wildcards are special characters that can be used to access multiple files at once, dramatically increasing your efficiency when working with many files. The most commonly-used wildcard is *, which is used to match zero or more characters.
1
*.pdb
Matches every file that ends with the '.pdb' extension
2
p*.pdb
Only matches pentane.pdb and propane.pdb, because the 'p' at the front only matches filenames that begin with the letter 'p'
The Question Mark Wildcard
Another common wildcard is ?, which matches any character exactly once. For example:
?ethane.pdb would only match methane.pdb (whereas *ethane.pdb matches both ethane.pdb and methane.pdb)
???ane.pdb matches three characters followed by ane.pdb, giving cubane.pdb ethane.pdb octane.pdb
When the shell sees a wildcard, it expands the wildcard to create a list of matching filenames before running the command that was asked for. As an exception, if a wildcard expression does not match any file, Bash will pass the expression as an argument to the command as it is. For example, typing ls *.pdf in the molecules directory (which does not contain any PDF files) results in an error message that there is no file called *.pdf.
Navigation Exercise
Starting from /home/amanda/data, which of the following commands could Amanda use to navigate to her home directory (/home/amanda)?
1
cd .
2
cd /
3
cd /home/amanda
4
cd ../..
5
cd ~
6
cd home
7
cd ~/data/..
8
cd
9
cd ..
Correct Options:
3. Yes: This is an example of using the full absolute path
5. Yes: ~ stands for the user's home directory, in this case /home/amanda
7. Yes: Unnecessarily complicated, but correct
8. Yes: Shortcut to go back to the user's home directory
9. Yes: Goes up one level
Viewing and Editing File Contents
Viewing Commands
cat <file> — Display entire file content
head <file> / tail <file> — Show first/last 10 lines by default
more <file> / less <file> — Paginate file content for easier reading
grep <pattern> <file> — Search for text patterns inside files
Editing Tools
For editing, common tools include nano (simple and beginner-friendly) and the more powerful, advanced options like vim and emacs, which are staples for experienced users.
Looking Inside Files
For example, let's take a look at the cubane.pdb file in the molecules directory.
We will start by printing the whole content of the file with the cat command, which stands for "concatenate" (we will see why it's called this way in a little while):
cd molecules
cat cubane.pdb
Sometimes it is useful to look at only the top few lines of a file (especially for very large files). We can do this with the head command:
head cubane.pdb
Customizing File Views with Options
By default, head prints the first 10 lines of the file. We can change this using the -n option, followed by a number, for example:
head -n 2 cubane.pdb
COMPND      CUBANE
AUTHOR      DAVE WOODCOCK  95 12 06
This displays only the first 2 lines, giving you a quick peek at the file's beginning without overwhelming your screen.
Similarly, we can look at the bottom few lines of a file with the tail command:
tail -n 2 cubane.pdb
TER      17              1
END
This is particularly useful when monitoring log files or checking the most recent entries in a dataset.
Interactive File Browsing
Finally, if we want to open the file and browse through it interactively, we can use the less command:
less cubane.pdb
less will open the file in a viewer where you can use ↑ and ↓ to move line-by-line or the Page Up and Page Down keys to move page-by-page. You can exit less by pressing Q (for "quit"). This will bring you back to the console.
The name "less" comes from the Unix philosophy that "less is more"—it's an improvement over the older more command, allowing backward navigation through files.
Counting Words, Lines, and Characters
The wc (word count) command is a powerful tool for analyzing text files. It can count lines, words, and characters in one or more files.
wc *.pdb
  20  156 1158 cubane.pdb
  12   84  622 ethane.pdb
   9   57  422 methane.pdb
  30  246 1828 octane.pdb
  21  165 1226 pentane.pdb
  15  111  825 propane.pdb
 107  819 6081 total
In this case, we used the * wildcard to count lines, words, and characters (in that order, left-to-right) of all our PDB files. The output shows three columns: lines, words, and characters for each file, with a total at the bottom.
Customizing Word Count Output
Often, we only want to count one of these things, and wc has options for all of them:
-l
Counts lines only
-w
Counts words only
-c
Counts characters only
For example, the following counts only the number of lines in each file:
wc -l *.pdb
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
This focused output makes it easy to quickly assess file sizes or compare the number of records across multiple files. The -l option is particularly useful when working with data files where each line represents a record or observation.
Combining and Redirecting Output
Combining Files
Earlier, we said that the cat command stands for "concatenate". This is because this command can be used to concatenate (combine) several files together. For example, if we wanted to combine all PDB files into one:
cat *.pdb
This displays the contents of all PDB files one after another in the terminal.
Redirecting Output
The commands we've been using so far print their output to the terminal. But what if we wanted to save it into a file? We can achieve this by redirecting the output of the command to a file using the > operator.
wc -l *.pdb > number_lines.txt
Now, the output is not printed to the console, but instead sent to a new file. We can check that the file was created with ls.
The > operator will create a new file or overwrite an existing file. If you want to append to an existing file instead, use >>.
File Permissions and Ownership
Change Permissions
chmod: Modify read, write, and execute permissions for files and directories, controlling who can access and modify your data
Change Ownership
chown: Assign new owners and groups to files, controlling access at the user and group level
Permission Format
rwxr-xr--: Explained as permissions for owner, group, and others. The first three characters represent owner permissions, the next three represent group permissions, and the last three represent permissions for all other users
Security & Multi-user
Permissions are vital for security and effective collaboration in multi-user environments, preventing unauthorized access and accidental data modification
Example: Making a Script Executable
The command chmod 755 script.sh sets the script to be executable by all users, with the owner having full read, write, and execute permissions (7), while the group and others have read and execute permissions only (5).
Introduction to Shell Scripts
So far, we have been running commands directly on the console in an interactive way. However, to re-run a series of commands (or an analysis), we can save the commands in a file and execute all those operations again later by typing a single command. The file containing the commands is usually called a shell script (you can think of them as small programs).
For example, let's create a shell script that counts the number of atoms in one of our molecule files (in the molecules directory). We could achieve this with the following command:
cat cubane.pdb | grep "ATOM" | wc -l
This command chains together three operations:
cat cubane.pdb reads the file contents
grep "ATOM" filters for lines containing "ATOM"
wc -l counts the number of matching lines
Instead of typing this every time, we can save it as a reusable script that we can run whenever needed, making our workflow more efficient and reproducible.
Creating Your First Shell Script
Opening Nano Editor
We can create a file with Nano in the following way:
nano count_atoms.sh
This command opens the Nano text editor directly in your terminal, creating a new file named count_atoms.sh if it doesn't already exist, or opening it for editing if it does.
Navigating in Nano
Nano is a simple, user-friendly text editor. You navigate within the editor using your keyboard's arrow keys (← → ↑ ↓) as the mouse does not work inside Nano.
Saving and Exiting
Once you are done typing or editing:
Save your changes by pressing Ctrl+O (the caret ^ symbol often indicates the Ctrl key)
Nano will prompt you to confirm the filename—press Enter to save
Exit Nano by pressing Ctrl+X
If you haven't saved your changes, Nano will ask if you want to save before exiting
Script Content
For now, type this code into your script:
#!/bin/bash
cat cubane.pdb | grep "ATOM" | wc -l
The first line, #!/bin/bash, is known as the shebang. It tells the operating system which interpreter to use for executing the script. In this case, it specifies that the script should be run using bash, the Bourne-Again SHell.
Making Scripts Executable and Running Them
01
Save and Exit Nano
After typing your script content, press Ctrl+O to save and Ctrl+X to exit Nano
02
Add Execute Permissions
By default, new files do not have execute permissions. Use the chmod command to add execute permissions:
chmod +x count_atoms.sh
03
Run Your Script
Finally, you can run your script by typing:
./count_atoms.sh
The ./ prefix tells the shell to look for the script in the current directory
04
View the Output
When you run it, it will display the count of "ATOM" lines from cubane.pdb directly to your terminal
The +x option in chmod +x adds the execute permission for the owner, group, and others. This is a quick way to make a script runnable without specifying detailed permission codes.
The Power of For Loops
Loops are a programming construct which allow us to repeat a command or set of commands for each item in a list. As such, they are key to productivity improvements through automation. Similar to wildcards and tab completion, using loops also reduces the amount of typing required (and hence reduces the number of typing mistakes).
Going back to our molecules directory, suppose we wanted to count the number of atoms in each of our molecules' PDB files. As a reminder, here is the command to do this for one of our files:
cat cubane.pdb | grep "ATOM" | wc -l
Of course, we could manually then repeat this for each of our molecule files: cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb, propane.pdb. But that would be tedious and error-prone. Instead, we can use a loop to automate this process!
For Loop Syntax and Structure
Basic Syntax
for thing in list_of_things
do
  # Indentation within the loop is not required, but aids legibility
  operation_using ${thing}
done
The loop structure consists of:
for — Starts the loop
thing — Variable name (you choose this)
in list_of_things — Items to iterate over
do — Begins the command block
${thing} — Accesses the current item
done — Ends the loop
Practical Example
Taking our command above to count atoms, let's create a new script called count_loop.sh, where we apply this idea:
#!/bin/bash
for filename in cubane.pdb ethane.pdb methane.pdb
do
  # count the number of lines containing the word "ATOM"
  cat ${filename} | grep "ATOM" | wc -l
done
In this script, the loop will execute three times—once for each filename. Each time it runs (called an iteration), the variable filename takes on the next value in the list, and the command inside the loop is executed with that value.
Understanding Loop Execution
If we ran this script (bash count_loop.sh), we would get the following output:
16
8
5
Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable we specify (in this case filename). Then, the commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we call for the variable's value using $filename or ${filename}.
1
First Iteration
filename = cubane.pdb
Counts atoms in cubane.pdb
Output: 16
2
Second Iteration
filename = ethane.pdb
Counts atoms in ethane.pdb
Output: 8
3
Third Iteration
filename = methane.pdb
Counts atoms in methane.pdb
Output: 5
In our example, at each iteration of the for loop, the variable $filename stored a different value, cycling through cubane.pdb, ethane.pdb, and finally methane.pdb.
You can use wildcards in loops too! Instead of listing all files explicitly, you could write for filename in *.pdb to process all PDB files in the directory automatically.
Sequencing Commands and Shell Scripting
Command Chaining
Unix provides multiple ways to sequence commands:
; — Sequential execution (run regardless of success)
&& — Conditional success (run next only if previous succeeds)
|| — Conditional failure (run next only if previous fails)
Pipes
Use | to pass the output of one command as input to another. For example, ls -l | grep ".txt" lists files and filters for text files.
Variables
Use $1, $@ for script inputs, allowing your scripts to accept arguments and become more flexible
Wildcards
Use * to match multiple characters in file names (e.g., rm *.log to delete all log files)
Shell Scripting
Save commands in .sh files and run them with bash script.sh for reproducible workflows
Understanding File Formats in Unix/Linux
The file Command
The file command inspects files using magic numbers and content analysis to identify their type, regardless of the filename or extension.
No Strict Extensions
Unlike Windows, Unix/Linux doesn't rely on file extensions to determine file type. The operating system examines the file's actual content and structure. File extensions are optional conventions for human readability.
Example: file Command Output
$ file mydocument.txt
mydocument.txt: ASCII text

$ file myprogram
myprogram: ELF 64-bit LSB executable, x86-64

$ file archive.tar.gz
archive.tar.gz: gzip compressed data, from Unix
Common Types
Includes ASCII text, executable binaries, and compressed files (.gz, .zip, .tar)
Compression
Use gzip / gunzip for .gz files; tar for archiving multiple files together
The FASTA Format for Sequences
The FASTA format was invented in 1988 and designed to represent nucleotide or peptide sequences. It originates from the FASTA software package, but is now a standard in the world of bioinformatics.
Format Structure
The first line in a FASTA file starts with a > (greater-than) symbol followed by the description or identifier of the sequence. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.
Sample Sequences
>KX580312.1 Homo sapiens truncated breast cancer 1 (BRCA1) gene, exon 15 and partial cds
GTCATCCCCTTCTAAATGCCCATCATTAGATGATAGGTGGTACATGCACAGTTGCTCTGGGAGTCTTCAGAATAGAAACTACCCATCTCAAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGTAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAG
>KRN06561.1 heat shock [Lactobacillus sucicola DSM 21376 = JCM 15457]
MSLVMANELTNRFNNWMKQDDFFGNLGRSFFDLDNSVNRALKTDVKETDKAYEVRIDVPGIDKKDITVDYHDGVLSVNAKRDSFNDESDSEGNVIASERSYGRFARQYSLPNVDESGIKAKCEDGVLKLTLPKLAEEKINGNHIEIE
FASTA files are simple, human-readable, and widely supported by bioinformatics tools, making them the go-to format for sharing and analyzing biological sequence data.
Multiple Sequences in FASTA Files
A FASTA file can contain multiple sequences. Each sequence will be separated by their "header" line, starting with >. This makes FASTA files ideal for storing entire datasets, such as all proteins in an organism or all genes in a genome.
Example with Multiple Sequences
>KRN06561.1 heat shock [Lactobacillus sucicola DSM 21376 = JCM 15457]
MSLVMANELTNRFNNWMKQDDFFGNLGRSFFDLDNSVNRALKTDVKETDKAYEVRIDVPGIDKKDITVDYHDGVLSVNAKRDSFNDESDSEGNVIASERSYGRFARQYSLPNVDESGIKAKCEDGVLKLTLPKLAEEKINGNHIEIE

>3HHU_A Chain A, Human Heat-Shock Protein 90 (Hsp90)
MPEETQTQDQPMEEEEVETFAFQAEIAQLMSLIINTFYSNKEIFLRELISNSSDALDKIRYESLTDPSKLDSGKELHINLIPNKQDRTLTIVDTGIGMTKADLINNLGTIAKSGTKAFMEALQAGADISMIGQFGVGFYSAYLVAEKVTVITKHNDDEQYAWESSAGGSFTVRTDTGEPMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEK
Each sequence entry begins with a header line starting with >, followed by the sequence data on subsequent lines. The header typically contains an identifier and description. Sequences can span multiple lines for readability.
The FASTQ Format: Sequences with Quality
The FASTQ format is also a text-based format to represent nucleotide sequences, but it also contains the corresponding quality score of each nucleotide. It is the standard for storing the output of high-throughput sequencing instruments such as Illumina machines.
FASTQ Structure
A FASTQ file uses four lines per sequence:
01
Header Line
Begins with a @ character and is followed by a sequence identifier and an optional description (like a FASTA title line)
02
Sequence Line
The raw sequence letters (nucleotides)
03
Separator Line
Begins with a + character and is optionally followed by the same sequence identifier again
04
Quality Line
Encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence
Example FASTQ Entry
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*(((**+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Understanding PHRED Quality Scores
The quality score, also called the PHRED score, represents the probability that the corresponding base call is incorrect. Understanding these scores is crucial for assessing the reliability of sequencing data.
Key Characteristics
PHRED scores use a logarithmic scale
Represented by ASCII characters
Typically range from 0 to 40
Each character maps to a specific quality value
Q10
90% Accuracy
1 in 10 bases incorrect
Q20
99% Accuracy
1 in 100 bases incorrect
Q30
99.9% Accuracy
1 in 1,000 bases incorrect
Q40
99.99% Accuracy
1 in 10,000 bases incorrect
Higher the PHRED score, lower is the probability of incorrect base call, in turn higher the accuracy of the sequence. Quality scores are essential for filtering low-quality reads and ensuring the reliability of downstream analyses.
SAM/BAM Alignment File Format
SAM (Sequence Alignment/Map) is a text-based format for storing biological sequences aligned to a reference sequence, developed by Heng Li. It's widely used for storing data generated by Next Generation Sequencing technologies, usually mapped to a reference genome.
Format Components
The SAM format consists of two main sections:
Header section: Contains metadata about the reference sequences and alignment parameters
Alignment section: Contains the actual read alignments in tab-delimited format
BAM Format
The binary representation of a SAM file is a BAM file, which is a compressed SAM file. BAM files are more efficient for storage and processing, but are not human-readable. SAM files can be analyzed and edited with the software SAMtools.
In brief, it consists of a header section and reads (with other information) in tab-delimited format. Each alignment record contains 11 mandatory fields providing information about the read, its position, quality, and alignment characteristics.
SAM File Field Structure
The image above shows the detailed structure of SAM file fields. Each alignment line contains 11 mandatory fields that provide comprehensive information about how a sequence read aligns to a reference genome.
QNAME
Query template name (read identifier)
FLAG
Bitwise flag describing alignment properties
RNAME
Reference sequence name
POS
1-based leftmost mapping position
MAPQ
Mapping quality score
CIGAR
Alignment description string
SAM/BAM files are the standard output format for most alignment tools like BWA, Bowtie2, and HISAT2. Understanding their structure is essential for working with sequencing data and performing downstream analyses like variant calling.
VCF and GFF File Formats
The VCF Format
The VCF format is a text-based file format. VCF stands for Variant Call Format and is used to store gene sequence variations (SNVs, indels).
The format has been developed for genotyping projects, and is the standard to represent variations in the genome of a species. It uses a header region with a ## string to include metadata, followed by variant records in tab-delimited format.
Generic Feature Formats (GFF)
A GFF (general feature format; file extension .gff2 or .gff3) describes the various sequence elements that make up a gene and is a standard way of annotating genomes.
It defines the features present within a gene in the body of the GFF file, including transcripts, regulatory regions, untranslated regions, exons, introns, and coding sequences.
Example GFF File
##description: evidence-based annotation of the human genome (GRCh38), version 25 (Ensembl 85)
##provider: GENCODE
##contact: [email protected]
##format: gtf
##date: 2016-07-15
chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"
A GFF file should contain 9 columns, each describing different attributes of genomic features such as sequence name, source, feature type, start and end positions, score, strand, phase, and attributes.
Additional Bioinformatics File Formats
GTF - Gene Transfer Format
The GTF (gene transfer format) file type shares the same format as GFF files, though it is used to define gene and transcript-related features exclusively. The attribute field includes gene_id or transcript_id values.
BED - Browser Extensible Data
The BED file format includes information about sequences that can be visualized in a genome browser. BED files are tab-delimited and include up to 12 fields of data with consistent columns.
Tar.gz - Compressed Archive
The Tar.gz format (also called a "Tarball") is a compressed file type that can store bioinformatics software or raw data efficiently.
PDB - Protein Data Bank
PDB file formats contain atomic coordinates and are used for storing 3D protein structures by the Protein Data Bank. View with pyMOL.
PED - Pedigree Format
PED (.ped file extension) is a file format for pedigree analysis, which creates a familial relationship between different samples. Used with PLINK.
CSV - Comma Separated Values
CSV files store data where each line is a row and columns are delimited with a comma. Can store different types of sequencing data and open with Excel.
JSON - JavaScript Object Notation
JSON is a common file format used in a growing number of bioinformatics applications for structured data exchange.
Summary and Next Steps
Powerful Environment
Unix/Linux offers a robust and flexible environment through its command line interface, providing unmatched control over your computing resources
Efficiency Unlocked
Mastering basic commands streamlines file and system management tasks, transforming hours of manual work into seconds of automated processing
Automation & Beyond
Understanding file formats and scripting opens doors to automation and advanced usage, enabling reproducible research and scalable data analysis
Continue Your Journey
Ready to dive deeper? Explore advanced shell scripting, system administration, and diverse Linux distributions. The command line skills you've learned today form the foundation for powerful computational work in bioinformatics, data science, and beyond.
Start Practicing
Explore Advanced Topics
Open your terminal and try these commands today! The best way to learn is by doing. Practice these commands regularly, experiment with different options, and don't be afraid to make mistakes—that's how you learn!
Thank You
We hope this guide has empowered you to begin your journey with Unix/Linux and the command line. Remember, every expert was once a beginner—keep practicing, stay curious, and don't hesitate to explore!
Quiz: Practice Exercise 1
Question 1:
For this exercise, make sure you are in the course materials directory:
cd ~/Desktop/data-shell
Make a copy of the sequencing directory named backup.
Hint: Think about which command you use to copy files and directories. Remember that directories require a special option!
Quiz Answer: Exercise 1
Answer:
When copying an entire directory, you will need to use the option -r with the cp command (-r means "recursive").
What Doesn't Work
If we run the command without the -r option, this is what happens:
cp sequencing backup

cp: -r not specified; omitting directory 'sequencing'
This message is already indicating what the problem is. By default, directories (and their contents) are not copied unless we specify the option -r.
The Correct Solution
This would work:
cp -r sequencing backup
Running ls we can see a new folder called backup:
ls

README.txt  backup  books_copy.txt  coronavirus  molecules  sequencing  thesis_notes
The -r (recursive) flag is essential when working with directories. It tells cp to copy the directory and all of its contents, including subdirectories.
Quiz: Practice Exercise 2
Question 2: Oral Discussion
For this exercise, make sure you are in the course materials directory:
cd ~/Desktop/data-shell
Part A
What does cp do when given several filenames and a directory name?
mkdir -p backup
cp molecules/cubane.pdb molecules/ethane.pdb backup
Part B
In the example below, what does cp do when given three or more file names?
cp molecules/cubane.pdb molecules/ethane.pdb molecules/methane.pdb/
Take a moment to think about these questions before moving to the next card with the answer. Try running these commands in your terminal to observe the behavior!
Quiz Answer: Exercise 2
Answer:
1
Part A: Multiple Files to Directory
If given more than one file name followed by a directory name (i.e., the destination directory must be the last argument), cp copies the files to the named directory. This is the standard way to copy multiple files at once.
2
Part B: Error Case
If given three file names, cp throws an error such as the one below, because it is expecting a directory name as the last argument:
cp: target 'molecules/methane.pdb' is not a directory
The command fails because cp interprets the last argument as the destination, and in this case, it's a file, not a directory.
Key Takeaway: When using cp with multiple source files, the last argument must be a directory where all the files will be copied.
Quiz: Wildcard Challenge
Question 3:
Change into the molecules directory. Which ls command(s) will produce this output?
ethane.pdb   methane.pdb
1
ls *t*ane.pdb
2
ls *t?ne.*
3
ls *t??ne.pdb
4
ls ethane.*
Remember: The * wildcard matches zero or more characters, while ? matches exactly one character.
Quiz Answer: Wildcard Challenge
Answer:
1
No
This shows all files whose names contain zero or more characters (*) followed by the letter t, then zero or more characters (*) followed by ane.pdb. This gives ethane.pdb methane.pdb octane.pdb pentane.pdb.
2
No
This shows all files whose names start with zero or more characters (*) followed by the letter t, then a single character (?), then ne. followed by zero or more characters (*). This will give us octane.pdb and pentane.pdb but doesn't match anything which ends in thane.pdb.
3
Yes ✓
This fixes the problems of option 2 by matching two characters (??) between t and ne. This correctly matches ethane.pdb and methane.pdb.
4
No
This only shows files starting with ethane., which would only match ethane.pdb, missing methane.pdb.
Quiz: Output Redirection Exercise
Question 4:
Move to the directory sequencing and complete the following tasks:
01
List and Save
List the files in the run1/ directory. Save the output in a file called sequencing_files.txt.
02
Observe Replacement
What happens to the content of that file after you run the command ls run2 > sequencing_files.txt?
03
Append Instead
The operator >> can be used to append the output of a command to an existing file. Re-run both of the previous commands, but instead use the >> operator the second time. What happens now?
Hint: Remember the difference between > (overwrite) and >> (append)!
Quiz Answer: Task 1
Answer - Task 1:
To list the files in the directory we use ls, followed by > to save the output in a file:
ls run1 > sequencing_files.txt
We can check the content of the file:
cat sequencing_files.txt

sampleA_1.fq.gz
sampleA_2.fq.gz
sampleB_1.fq.gz
sampleB_2.fq.gz
sampleC_1.fq.gz
sampleC_2.fq.gz
sampleD_1.fq.gz
sampleD_2.fq.gz
The output shows all the FASTQ files from the run1 directory, neatly saved in our text file. This demonstrates how redirection with > creates a new file and writes the command output to it.
Quiz Answer: Task 2
Answer - Task 2:
If we run ls run2/ > sequencing_files.txt, we will replace the content of the file:
cat sequencing_files.txt

sampleE_1.fq.gz
sampleE_2.fq.gz
sampleF_1.fq.gz
sampleF_2.fq.gz
Notice that the original content from run1 is completely gone! The file now only contains the files from run2.
Important: The > operator overwrites the existing file content. All previous data is lost. This is why it's crucial to understand the difference between > and >>!
Quiz Answer: Task 3
Answer - Task 3:
If we start again from the beginning, but instead use the >> operator the second time we run the command, we will append the output to the file instead of replacing it:
ls run1/ > sequencing_files.txt
ls run2/ >> sequencing_files.txt

cat sequencing_files.txt

sampleA_1.fq.gz
sampleA_2.fq.gz
sampleB_1.fq.gz
sampleB_2.fq.gz
sampleC_1.fq.gz
sampleC_2.fq.gz
sampleD_1.fq.gz
sampleD_2.fq.gz
sampleE_1.fq.gz
sampleE_2.fq.gz
sampleF_1.fq.gz
sampleF_2.fq.gz
Perfect! Now we have all files from both directories in a single list. The >> operator preserved the original content and added the new content at the end.
Quiz: Coronavirus Variants Challenge
Question 5:
In the directory coronavirus/variants/, there are several CSV files with information about SARS-CoV-2 virus samples that were classified according to clades (these are also commonly known as coronavirus variants).
01
Combine Files
Combine all files into a new file called all_countries.csv
Hint: Use wildcards and output redirection
02
Filter for Alpha
Create another file called alpha.csv that contains only the Alpha variant samples
Hint: Think about pattern searching
03
Count Alpha Samples
How many Alpha samples are there in total?
Hint: Use the appropriate counting command
Quiz Answer: Coronavirus Variants
Task 1: Combine All Files
We can use cat to combine all the files into a single file:
cat *_variants.csv > all_countries.csv
The wildcard *_variants.csv matches all CSV files ending with "_variants.csv", and the > operator saves the combined output to a new file.
Task 2: Filter for Alpha Variant
We can use grep to find a pattern in our text file and use > to save the output in a new file:
grep "Alpha" all_countries.csv > alpha.csv
We could investigate the output of our command using less alpha.csv.
Task 3: Count Alpha Samples
We can use wc to count the lines of the newly created file:
wc -l alpha.csv
Giving us 38 as the result.
This exercise demonstrates how to chain together multiple commands to perform data analysis tasks: combining files, filtering by pattern, and counting results. These are common operations in bioinformatics workflows!
Keep Learning!
Congratulations on completing this comprehensive guide to Unix and Linux! You've taken your first steps into a powerful world of command-line computing. Remember, mastery comes with practice and experimentation.
The command line is a tool that grows with you—the more you use it, the more efficient and creative you'll become. Don't be discouraged by challenges; every error message is a learning opportunity.
Practice More
Join Our Community