Wordlist creator pulling words from many documents on my computer using mostly sed.

None-yet

Member
Joined
Aug 10, 2020
Messages
78
Reaction score
32
Credits
906
Posting this in Kali since this is manly going to be used as a Kali type tool.

Code purpose? 1- A teaching tool to provide useful info to schools, nursing homes, Churches, libraries and more where I volunteer to show one method used by people to generate info to crack passwords in a visual manner that they can see and learn to better enable their understanding of how such people think. 2- To provide an easier method teachers can generate word-lists in their classroom based on their own individual students cognitive and learning abilities rather than using completely generic word-lists handed down as a one size fits all type of tool based on where students should be as opposed to where they really are based on a teachers own judgement. 3- My own growth in my field of employment as well.

Here is what I have written:

Code:
cat *.txt |

    perl -nle 'print if m{^[[:ascii:]]+$}' |
    sed '/^@/d' |
    sed '/^http/d' |
    sed 's/[0-9]*//g' |
    sed 's/ /\n/g' |
    sed 's/\r/\n/g' |
    sed 's/^[[:punct:]]*//' |
    sed 's/[[:punct:]]*$//' |
    tr '[:upper:]' '[:lower:]' | sort | uniq > 1-1.txt

I have written, not sure if you would call it a program, app or a script. Any ways, using what I have learned here on this forum and doing some searching to come up with a manner using sed mostly to search through files on my computer and have this app generate a wordlist that meets my specific needs.



I do not know how much detail I need to go into here because I am sure you can look at the included attachment and see what I am trying to do. I will give a quick overview for anyone who visits and, like we all were (and I still am) don’t have a clue.



This I wanted it to be able to search a document(s) as (input) and create another (output) document that had every word one per line by using all spaces as a marker and then using those markers to insure each word was on a separate line, in alfa order ignoring all punctuation and numbers. This will make everything lower-case and delete any blank lines also. The output file will be ascii encoding so that special charsets would be ignored as well as non-English chars. It should delete any line with the @ symbol and http so as to prevent email addresses and websites from being included in the output file. Finally it would alpha organize the words and remove any duplicate words.



I am having issues with non- punctuation chars showing up such as <, #, %, / and more. I could deal with these in the same manner that I dealt with @ and http however I was thinking there could be a better method of doing this.



Other issues I am having that I have not been able to deal with is:

  • Although it does deal with the blank lines it always has line number one as a blank line.
  • Most of the words are formatted as I want I am getting some words that are 3 spaces back while some others are as much as 15 spaces back.
  • I get a black background box with 2 to 4 letters and I have no idea why.
  • I like the idea of having “-“ being used like this “aluminum-extruded” but I also will have a hyphen showing up in ways I don’t want it.
  • I am getting things like “amanue!oglvee!norm” that I do not understand why. Being a typo I get but I did a search in this folder and that did not come in as a result.
  • I understand why things like this happen “amateur/commercial/educational” but not sure if or how to I can break them up where the “/” would be used as a marker for a new line like I use the space’s as a marker for a new line.
  • I also get things like this “ambitinternational” but again searching the documents I do not get this as a result.
  • I get things like “asharp-inchscreen,powerfulintelatomprocessorandanmpshooterwitharangeofmodes” and it blows my mind on how to stop this
There may be some other things I have overlooked.

In the end the output should provide a nice sorted one word per line file.



I would like for you’s to pick it apart and let me know if I could have done any part of this in a different way, or whatever. Like in my introduction post I am here to learn so completely destroying what I have done is fine with me. I need the knowledge so if you see a better way I would like to know.



Other things I plan on adding is the capability to compare the output file with another list and not print out words I already have on that list.



I will throw in a question. Can this be made into a program so I can just click it and have it do everything by a click rather than having to put it in the command-line. I considered using ./ but decided to wait until I got feedback before doing it. I fully intend for everything to be in a single folder.



I understand this is a lot to ask so if it takes time to get an answer I understand. I so much wanted to find ways to deal with these issues on my own but after 5 days I thought maybe I needed to ask for help. I think I could deal with some of these issues by constantly repeating some sed commands but didn’t know if it would be good form to do so.



Thanks for any and all help anyone can supply. Over all I think I did pretty good but came up short. Instead of breaking this down into several posts my thinking is if anyone else comes through that having it all in one post would be more beneficial to them as well.



Thanks!
 

Attachments

  • 1-1.zip
    772.7 KB · Views: 432


I'll try to take a more detailed look at this later.
One thing that instantly pops out at me, is that you could do all of your sed operations using a single sed invocation, rather than using multiple instances of sed with a pipe.

All you do is chain all of your operations together, separating them with commas ; and sed will perform each of them in a single run.

So your code would look something like this:
Bash:
cat *.txt |
perl -nle 'print if m{^[[:ascii:]]+$}' |
sed '/^@/d;/^http/d;s/[0-9]*//g;s/ /\n/g;s/\r/\n/g;s/^[[:punct:]]*//;s/[[:punct:]]*$//' |
tr '[:upper:]' '[:lower:]' | sort | uniq > 1-1.txt

I admit - it's not very readable, but it will be much more efficient.
 
I new there was a way. Just wasn't sure how best to do it. This will provide me a way to break your code apart and see how it is supposed to be done. Thanks!
 
OK I've made my own attempt at this. Literally just using sed and sort.
See what you think:
Bash:
sed 's/[ \t\r[:punct:][:digit:][:cntrl:]]/\n/g;' *.txt | sed '/^$/d' | sort -u > final-wordlist.txt
Note - in the above, you can pass it a single .txt file, but as you wanted to get input from several files, I've used *.txt as per your original post. We don't need to use cat as an input to sed, you can just pass sed the filenames to act upon!

The first call to sed:
- converts all spaces, tabs, carriage returns, punctuation characters, digits and control characters to newline characters.

My rationale for this:
Using sed to convert all of the unwanted characters to newlines will effectively split all of the input into lines that are either:
- completely blank
Or
- that that contain single words (or in some cases a single random sequence of characters).
Any hyphenated words will be split over two lines.
This should improve the overall quality of the words found and reduce the amount of noise from words made of random character sequences.

The output from that is piped to sed for a second pass, which removes ALL blank lines from the stream of text.
Rationale:
- This leaves us with lines containing single words (or character sequences).

Then we pipe everything that remains to sort and redirect the output to the output file.

Note: we're using sort's -u (unique) option. That way we don't need to pipe to uniq before redirecting to the output file. We can redirect the output file immediately after the sort instead.

So that should work for extracting words from a single file, or a number of files!
 
This I am sure is a problem with file encoding. Where it displays A's. I get that at different points but it may be d's or q's. I do not know for sure if that may be a problem within the source files. I just pulled a bunch of files together to do this with. I did perform a search on all files and nothing showed.

Is, which I am sure it is, but can I control the amount of letters in a word. Such if I wanted to only print out words over 4 letters long? Also, sed,awk,grep and so on. Do they only read text files or can they also read Word and pdf documents and other types?

Dude, I don't know what kind of work you do. I hope it is computer related cause you are really good as is a lot of people on this forum. You come across as very patient. You may be rolling your eyes and thinking (O#*^!) it's him again. But if you were it never comes across. Always thankful for any help I get.
 

Attachments

  • 1.jpg
    1.jpg
    22.6 KB · Views: 324
  • 2.jpg
    2.jpg
    20.3 KB · Views: 316
No, it’s not an encoding thing. It looks to me like those characters are just UTF-8 characters that aren’t available in the font you’re using in scite. It looks as if the character codes are included in the images.
But I’m on my phone atm, so can’t zoom into the pictures to make out the digraph codes for them.
Using a different font should make those characters visible. But exactly which font, I have no idea!!

And yes, I am a professional programmer. Currently working as a C/C++ analyst/programmer.

I’ve been a programming nerd since I was 8 or 9 years old.
I enjoy learning new programming languages and I love a good programming puzzle. So I like helping people out with their programming problems on forums like this. Sometimes things are easy, other times they require a bit of effort and I learn something too!
 
Last edited:
Wanted to let you know. As I have said I volunteer a lot of time to help others do things on their pc and stay safe. Something I also do is that I generate audio books for churches, nursing homes and some other places. We have several school schools for the blind in the area. I have done a ton of work for them over the years. Every week I get a list of books or if they have a book then they send me a copy and I make them an audio version. I have always want to learn sed, awk, grep and so on.One day I was working on some of these audio books and got to thinking that sed would be a large help with a lot of what I have to do with these audio books.

When making these books I have to find things and type a caption and things like that to add just a tad more understanding for them so they can stay with the story. I would say that 80% of the time I have to type in the same thing.

Well, with your help I have been able to generate more books for them. I normally can do 20-30 a week. I get maybe on average 70-80 requests a week. Last week I was able to generate 62 audio books total thanks to your help. After I go in and do my part I then run it through this program to generate the voice reading the book. The whole process can talk 1-2 hours per book. I love doing this to help them out.

With your help I am more productive for them. I see more of them smile when they know I have their book. So I want to thank you for allowing me to see more smiles and get a few extra hugs. The range in age is like around 10 onward to about 30. And on their behalf I want to give you another thank you. Getting 62 books out last week was a very amazing effort even to me. So Thank you and Thank you.

On the font thing I will see what I can do. If you come up with anything you think would help please let me know.
 
maybe you should set up a go fund me and expand what you are doing into some sort of NGO ? 3rd World eg Ghana schools for deaf ,blind very limited. The access to such talking books would be Gold to blind kids 3rd world; the issue would also be the means to play such material of course ?
 
As for it being gold to them. Yes, I receive emails weekly from many who want to thank me. Some have even actually sent me cash in the post. I have never charged a dime for anything.

These schools take people in from all over the world. Teach them the basics of living blind. Most were born blind with about 40% that had some sort of incident that resulted in blindness. Lots of soldiers are their. Soldiers from all around the world. This week I will finish books for soldiers from Denmark, England, The U.S, Auckland, Russia. I can generate any languish.

For me. When I pass beyond this world I want to know I have done something that impacted someones life in a positive manner. You are helping me do this. I will take your suggestion and run through my head. I enjoy the fact I am not getting paid however I understand if their was some kind of funding I could do more for more people.

Alright, enough. We can close this out. If you are sure it is a font issue then I will take a look at my fonts and see what I can do. Thanks for your help. Sould be posting again again soon. Thanks
 
Realised I forgot to reply to these:
can I control the amount of letters in a word. Such if I wanted to only print out words over 4 letters long?
Yes. By using regular expressions with sed, awk, tr, grep etc - you have total control over what you extract from a file, or stream containing text.

I'm not exactly a wizard with regular expressions myself. I have a very basic "vocabulary" with it, which serves me for things that I do regularly. But whenever I'm trying to do something new, or something I haven't done for a while - I usually have to do a bit of research and trial and error first.

Also, sed,awk,grep and so on. Do they only read text files or can they also read Word and pdf documents and other types?
The various GNU text manipulation tools work best on pure text-files. That is what they were designed for.

So any files that use a pure text format - .txt, config files, shell-scripts, source-code etc will work really well with these tools.

The problem with word documents and .pdf files is that they are binary files. Much of the text is encoded, or even encrypted inside the document. So if you run sed and awk on them - you won't find many useful strings at all.

And it's a similar story with other binary files, like executables.

There is a command called strings, which is part of the binutils package - which can be handy for searching for strings in executable binaries.

For example:
If you use strings -d /path/to/executable to search the data section of an executable - You will be able to find any strings that are hard-coded into the application - potentially including passwords!!
And if you add the -tx option, it will show you the hexadecimal byte-offset in the file where the string starts. If you use the -a option, it will search the entire file for strings.

It can also be used for searching other types of binary files - but again - it's the same problem as grep, sed et al. Most text in binary documents is encoded in some way - so strings will not be able to extract text from a word, or pdf file.

There is pdftotext - part of the poppler-utils package - which can convert a pdf file to text. So you could potentially extract text from pdf files that way.

From quickly looking at it online, you can run it like this:
Bash:
pdftotext /path/to/pdffile.pdf
And that will open /path/to/pdffile.pdf and will automatically create a text file at /path/to/pdffile.txt.
OR
You can use:
Bash:
pdftotext /path/to/pdffile.pdf /path/to/output.txt
That opens the pdf-file and it will create a text file at /path/to/output.txt.

For your wordlist creation - it looks like you should be able to do something like this, without creating a text file:
NOTE: Everything from here onwards is conjecture - I haven't actually tested any of this!
Bash:
pdftotext /path/to/pdffile.pdf - | sed 's/[ \t\r[:punct:][:digit:][:cntrl:]]/\n/g;' | sed '/^$/d' | sort -u > /path/to/wordlist.txt

By specifying - as the output file above, that should convert the pdf file to a text stream in stdout without creating a file. Meaning you can pipe the text stream to another process.
In this case, I've piped it to the sed commands we used earlier.

That should tokenise the stream to either be blank lines, or lines containing a single word, then we can strip the blank lines and then sort the remaining output, before writing the wordlist.

And I don't know if pdftotext supports multiple input files - if it does - you should be able to use
Code:
/path/to/*.pdf
as the input file:
Bash:
pdftotext /path/to/*.pdf - | sed 's/[ \t\r[:punct:][:digit:][:cntrl:]]/\n/g;' | sed '/^$/d' | sort -u > /path/to/wordlist.txt

If it does - great - if not, we'd have to set up a for loop around the previous example.
Which I think would end up looking like this:
Bash:
for pdf in *.pdf ; do
    pdftotext ${pdf} - | sed 's/[ \t\r[:punct:][:digit:][:cntrl:]]/\n/g;' | sed '/^$/d' | sort -u >> /path/to/wordlists.txt
done
sort -u /path/to/wordlists.txt > /path/to/final-wordlist.txt
rm /path/to/wordlists.txt

In the above, the content of each pdf file is converted to a stream of text, tokenised and sorted uniquely and then appended to a file called wordlists.txt.

After all of the pdf's have been ran through pdftotext - we'll have a massive file (wordlists.txt) containing the sorted strings from each of the pdf's we just converted to text.

So now we need to run a final unique sort on the content of wordlists.txt - in order to merge everything together and create a final list of unique words from all of the files (final-wordlist.txt).

After generating final-wordlist.txt, we can delete wordlists.txt because it's no longer needed - it will just be using up valuable space!

And again - like I said - this was all conjecture. I haven't actually tried it. That was just what I could infer from what I've read online about the pdftotext command.

As long as the .pdfs you convert are DRM free - I'd imagine that pdftotext should work really well. I don't know if it will work at all with .pdf files that are protected by DRM.

There is also a pdfgrep command that will allow you to search for strings inside pdf files.

For M$ word documents, there are a few tools - like docx2txt (I think the package is also called docx2txt) - Again, I've never used it - but I'm aware of it. I imagine that will work similarly to pdf2text.

And there is a libreoffice tool called unoconv - that can convert any file-types that are compatible with Libreoffice into html files, but I think it has the option to output as plain text too. So that might be an option. So AFAICT - that should work with any M$ .doc files that will open in libreoffice!

I'm not aware of any tools for grepping through .doc files though. There may be some. I try to completely avoid MS office - even at work! I'd much rather use Libreoffice! Ha ha!
 
I use Acrobat Pro and it converts about anything to pdf then I can use the option to convert them to text. Even works with Libreoffice . I like Libreoffice and use it often. I like open source so I try to support what I like as much as I can.

My library currently has almost 400.000 book titles and I love White Papers so I have over 150.000. I have a spider that crawls websites 24/7 pulling white papers and other data then sending them back to a server system I have. In my shop I run a Xeon E3-1246v3 server running Nemix ram 128GB 8x16GB DDR3-1333 ecc into 2 2tb hard drives with a searchable cabinet holding 16tb of hard data. It's completely insane what that spider picks up. It has capability to brute force but I have never used that function. It crawls open sites and ftp and anything that doesn't require passwords. I just thought about you getting nervous because I have been working on a way to grab and generate word-lists. lol Putting everything together don't sound good. I a sure you that's not what this is for. Anyways, with acrobat I can convert anything over to text. No problem at all.

I want to let you know that I have run this lastest one with the first one you helped me put together. The first one works much better. If I can figure out the minimum letter part I think I will be nearing "as perfect" territory. Because I think that will elimanate the lines with stuff like "og and taop" . That stuff that makes so since. I can remove blank lines in Editpad. This project however is fun to work on and has helped me with the basic understanding of sed and awk. If I keep working on it and adding more functions then I will learn even more.

If you use strings -d /path/to/executable to search the data section of an executable - You will be able to find any strings that are hard-coded into the application - potentially including passwords!!
Pass words inside an .exe file. I would like you to explain that because I was unaware that .exe files could have passwords encoded somewhere like that. Or should I start another post for that.

Dude, if you like close by, I would wear out my welcome in one weekend. lol

OK, thanks. But really, the passwords in an exe file does intrigue me. Not that I would but just to know more about it would be great. Ok, thanks. Emjoy your weekend!

 

Staff online

Members online


Top