clean up (database) tables

In my case, articles appeared twice or even umpteen times in the post overview and under the counters. The reason was the Prosodia plugin, which had many duplicate rows in a table. But how do you clean up these duplicates?

Manually? In a 12 MB SQL dump? Probably not. I’d be sitting on it for days! I also couldn’t find any tools that clean up MySQL dumps. Crap.

But it’s really quick with a Python script:

import os

# User prompt to enter the source file

input_file = input("\nThe input file will \033[91m\033[1mNOT\033[0m be modified! \nAn output file and a file with the deleted data will be created. If they already exist, they will be overwritten without confirmation!\nPlease provide the path to the input file: ").strip()

# Check whether the file exists

if not os.path.isfile(input_file):

print(f "The file '{input_file}' does not exist. Please check the path.")

exit(1)

# Automatic generation of the output files

output_file = f"{os.path.splitext(input_file)[0]}_cleaned{os.path.splitext(input_file)[1]}"

deleted_file = f"{os.path.splitext(input_file)[0]}_deleted{os.path.splitext(input_file)[1]}"

# Set for saving unique lines

seen_lines = set()

# Counter for the lines

written_lines = 0

deleted_lines = 0

# Determine the size of the input file in bytes

input_size = os.path.getsize(input_file)

# Auxiliary function for formatting the file sizes in US format

def format_size(size_in_bytes):

if size_in_bytes < 1024:

return f"{size_in_bytes:,.2f} bytes"

elif size_in_bytes < 1048576:

return f"{size_in_bytes / 1024:,.2f} KB"

elif size_in_bytes < 1073741824: return f"{size_in_bytes / 1048576:,.2f} MB" else: return f"{size_in_bytes / 1073741824:,.2f} GB" try: with open(input_file, mode="r", encoding="utf-8") as infile, \ open(output_file, mode="w", encoding="utf-8") as outfile, \ open(deleted_file, mode="w", encoding="utf-8") as dfile: for line in infile: # Keep blank lines in the output if line.strip() == "": outfile.write(line) # Output blank line unchanged continue # Write line without duplicates in the output file if line not in seen_lines: outfile.write(line) seen_lines.add(line) # Mark line as seen written_lines = 1 else: deleted_lines = 1 # Count deleted duplicates dfile.write(line) # Write deleted lines to the separate file # Optional: Output a section of the currently processed line (max. 100 characters) print(f "Processing line: {line[:100]}...") # Show the first 100 characters of the line # Determine the size of the output file in bytes output_size = os.path.getsize(output_file) deleted_size = os.path.getsize(deleted_file) # Calculate how much data was removed size_difference = input_size - output_size # Calculate the percentage of deleted data in % deleted_percentage = (deleted_size / input_size) * 100 if input_size > 0 else 0

# Output of the final information

print(f"\n\n\033[0m*********** F I N I S H E D ***********")

print(f"\nInput file: \033[91m\033[1m{input_file}\033[0m")

print(f "Cleaned file: \033[91m\033[1m{output_file}\033[0m")

print(f "Deleted content file: \033[91m\033[1m{deleted_file}\033[0m")

print(f"\n\033[91m\033[1m{deleted_lines:,}\033[0m duplicate lines have been successfully removed.")

print(f "Number of output lines: \033[91m\033[1m{written_lines:,}\033[0m")

print(f"\nInput file size: \033[91m\033[1m{format_size(input_size)}\033[0m")

print(f "Cleaned file size: \033[91m\033[1m{format_size(output_size)}\033[0m")

print(f "Deleted data size: \033[91m\033[1m{format_size(deleted_size)}\033[0m")

print(f "Deleted data percentage: \033[91m\033[1m{deleted_percentage:,.2f}%\033[0m")

print(f"\nPlease check the output files!")

print(f"\nGood luck! \n")

except Exception as e:

print(f "Oops! An error occurred: \033[91m\033[1m{e}\033[0m")

import os # User prompt to enter the source file input_file = input("\nThe input file will \033[91m\033[1mNOT\033[0m be modified! \nAn output file and a file with the deleted data will be created. If they already exist, they will be overwritten without confirmation!\nPlease provide the path to the input file: ").strip() # Check whether the file exists if not os.path.isfile(input_file): print(f "The file '{input_file}' does not exist. Please check the path.") exit(1) # Automatic generation of the output files output_file = f"{os.path.splitext(input_file)[0]}_cleaned{os.path.splitext(input_file)[1]}" deleted_file = f"{os.path.splitext(input_file)[0]}_deleted{os.path.splitext(input_file)[1]}" # Set for saving unique lines seen_lines = set() # Counter for the lines written_lines = 0 deleted_lines = 0 # Determine the size of the input file in bytes input_size = os.path.getsize(input_file) # Auxiliary function for formatting the file sizes in US format def format_size(size_in_bytes): if size_in_bytes < 1024: return f"{size_in_bytes:,.2f} bytes" elif size_in_bytes < 1048576: return f"{size_in_bytes / 1024:,.2f} KB" elif size_in_bytes < 1073741824: return f"{size_in_bytes / 1048576:,.2f} MB" else: return f"{size_in_bytes / 1073741824:,.2f} GB" try: with open(input_file, mode="r", encoding="utf-8") as infile, \ open(output_file, mode="w", encoding="utf-8") as outfile, \ open(deleted_file, mode="w", encoding="utf-8") as dfile: for line in infile: # Keep blank lines in the output if line.strip() == "": outfile.write(line) # Output blank line unchanged continue # Write line without duplicates in the output file if line not in seen_lines: outfile.write(line) seen_lines.add(line) # Mark line as seen written_lines = 1 else: deleted_lines = 1 # Count deleted duplicates dfile.write(line) # Write deleted lines to the separate file # Optional: Output a section of the currently processed line (max. 100 characters) print(f "Processing line: {line[:100]}...") # Show the first 100 characters of the line # Determine the size of the output file in bytes output_size = os.path.getsize(output_file) deleted_size = os.path.getsize(deleted_file) # Calculate how much data was removed size_difference = input_size - output_size # Calculate the percentage of deleted data in % deleted_percentage = (deleted_size / input_size) * 100 if input_size > 0 else 0 # Output of the final information print(f"\n\n\033[0m*********** F I N I S H E D ***********") print(f"\nInput file: \033[91m\033[1m{input_file}\033[0m") print(f "Cleaned file: \033[91m\033[1m{output_file}\033[0m") print(f "Deleted content file: \033[91m\033[1m{deleted_file}\033[0m") print(f"\n\033[91m\033[1m{deleted_lines:,}\033[0m duplicate lines have been successfully removed.") print(f "Number of output lines: \033[91m\033[1m{written_lines:,}\033[0m") print(f"\nInput file size: \033[91m\033[1m{format_size(input_size)}\033[0m") print(f "Cleaned file size: \033[91m\033[1m{format_size(output_size)}\033[0m") print(f "Deleted data size: \033[91m\033[1m{format_size(deleted_size)}\033[0m") print(f "Deleted data percentage: \033[91m\033[1m{deleted_percentage:,.2f}%\033[0m") print(f"\nPlease check the output files!") print(f"\nGood luck! \n") except Exception as e: print(f "Oops! An error occurred: \033[91m\033[1m{e}\033[0m")

import os

# User prompt to enter the source file
input_file = input("\nThe input file will \033[91m\033[1mNOT\033[0m be modified! \nAn output file and a file with the deleted data will be created. If they already exist, they will be overwritten without confirmation!\nPlease provide the path to the input file: ").strip()

# Check whether the file exists
if not os.path.isfile(input_file):
    print(f "The file '{input_file}' does not exist. Please check the path.")
    exit(1)

# Automatic generation of the output files
output_file = f"{os.path.splitext(input_file)[0]}_cleaned{os.path.splitext(input_file)[1]}"
deleted_file = f"{os.path.splitext(input_file)[0]}_deleted{os.path.splitext(input_file)[1]}"

# Set for saving unique lines
seen_lines = set()

# Counter for the lines
written_lines = 0
deleted_lines = 0

# Determine the size of the input file in bytes
input_size = os.path.getsize(input_file)

# Auxiliary function for formatting the file sizes in US format
def format_size(size_in_bytes):
    if size_in_bytes < 1024:
        return f"{size_in_bytes:,.2f} bytes"
    elif size_in_bytes < 1048576:
        return f"{size_in_bytes / 1024:,.2f} KB"
    elif size_in_bytes < 1073741824: return f"{size_in_bytes / 1048576:,.2f} MB" else: return f"{size_in_bytes / 1073741824:,.2f} GB" try: with open(input_file, mode="r", encoding="utf-8") as infile, \ open(output_file, mode="w", encoding="utf-8") as outfile, \ open(deleted_file, mode="w", encoding="utf-8") as dfile: for line in infile: # Keep blank lines in the output if line.strip() == "": outfile.write(line) # Output blank line unchanged continue # Write line without duplicates in the output file if line not in seen_lines: outfile.write(line) seen_lines.add(line) # Mark line as seen written_lines = 1 else: deleted_lines = 1 # Count deleted duplicates dfile.write(line) # Write deleted lines to the separate file # Optional: Output a section of the currently processed line (max. 100 characters) print(f "Processing line: {line[:100]}...") # Show the first 100 characters of the line # Determine the size of the output file in bytes output_size = os.path.getsize(output_file) deleted_size = os.path.getsize(deleted_file) # Calculate how much data was removed size_difference = input_size - output_size # Calculate the percentage of deleted data in % deleted_percentage = (deleted_size / input_size) * 100 if input_size > 0 else 0

    # Output of the final information
    print(f"\n\n\033[0m*********** F I N I S H E D ***********")
    print(f"\nInput file: \033[91m\033[1m{input_file}\033[0m")
    print(f "Cleaned file: \033[91m\033[1m{output_file}\033[0m")
    print(f "Deleted content file: \033[91m\033[1m{deleted_file}\033[0m")
    print(f"\n\033[91m\033[1m{deleted_lines:,}\033[0m duplicate lines have been successfully removed.")
    print(f "Number of output lines: \033[91m\033[1m{written_lines:,}\033[0m")
    print(f"\nInput file size: \033[91m\033[1m{format_size(input_size)}\033[0m")
    print(f "Cleaned file size: \033[91m\033[1m{format_size(output_size)}\033[0m")
    print(f "Deleted data size: \033[91m\033[1m{format_size(deleted_size)}\033[0m")
    print(f "Deleted data percentage: \033[91m\033[1m{deleted_percentage:,.2f}%\033[0m")
    print(f"\nPlease check the output files!")
    print(f"\nGood luck! \n")

except Exception as e:
    print(f "Oops! An error occurred: \033[91m\033[1m{e}\033[0m")

Copy the code and paste it into a text editor of your choice. Save the script under a name like „. Make the script executable as a programme under Properties and then start it.

Even with large dumps, this is so quick that you will wonder whether anything has run at all. :-)

Of course, this script also works with other files such as .txt or .csv. Remember to make a backup copy of your original file before using the script – better safe than sorry!

Background

How do I change an SQL dump without locally installed MySQL and phpMyAdmin? Difficult! So I had Prosodia export the payment stamp tables as .csv. We are talking about tables in which the complete text of all articles is also stored. Unhandy is not the right word. Libre Office Calc is also a no-go here:

In the Phyton script I then tried table functions like csv.field_size_limit, unique_rows, header, csv.reader, delimiter, quotechar, skipinitialspace and all that kind of stuff.

This led to total chaos and with every adjustment it got worse and worse! Another solution had to be found. A simpler solution had to be found!

What do I normally use to edit scripts or dumps? That’s right! With gedit! And if they are bigger? With kate! These are both text editors. And an SQL dump is ultimately just text!

In my case, the simplest method – working with the file as a text file without having to deal with table functions or more complex data structures – was the best solution.

If I treat the file as a plain text file, I can compare the lines one-to-one and delete duplicate lines without having to worry about CSV parsing, quoting or field separators. Keep it simple!

clean up (database) tables

Background

Über den Autor

Hessi

Schreibe einen Kommentar Antworten abbrechen