I have unix sockets sitting in my source dir:
$ ls -al /home/vagrant/pdns/pdns/pdns.controlsocket
srwxr-xr-x 1 root root 0 Jul 23 14:58 /home/vagrant/pdns/pdns/pdns.controlsocket
This makes codespell abort:
$ ./codespell.py -s ~/pdns
[...]
Traceback (most recent call last):
File "./codespell.py", line 527, in <module>
sys.exit(main(*sys.argv))
File "./codespell.py", line 516, in main
parse_file(os.path.join(root, file), colors, summary)
File "./codespell.py", line 378, in parse_file
if not istextfile(filename):
File "./codespell.py", line 298, in istextfile
with open(filename, mode='rb') as f:
IOError: [Errno 6] No such device or address: '/home/vagrant/pdns/pdns/pdns.controlsocket'
Most often users just need the default dictionary, so let's create an option
with default value set to dictonary.txt. Users can overwrite this parameter with
its own dictionary.
By default codespell will check $PWD/data/dictionary.txt file. It is where
dictionary is stored if one uses codespell from sources tar or git repo.
It is expected that package managers will install dictionary to shared data folder
and change default value accordinally.
When opening the dict, we know it's in UTF-8, so do not rely on the
environment.
Thanks to Ettl Martin for reporting the problem and founding the
solution.
The tradeoff is it's much, much, much slower. In my tests, circa 10
times slower than without chardet. But it always use the right encoding.
Maybe the right thing to do is only a fallback to chardet since most of
source code is in ascii/utf-8/iso8859-1. This will be left undecided
until 1.2 comes out.
When one line had the same mispelled word, codespell was incorrectly
fixing that line, even introducing new typos. This was because the list
of misspelled words is not updated according to the fixes.
Instead of always updating this list and making the loop more difficult,
we do as following:
- Cache the words that are fixed in a certain line
- Fix all cases of a misspelled in each line (this means that
interactive mode will fix all cases with the same suggestions... not
awesome, but simplifies a lot the code)
- Use a regex with re.sub() instead of the naive string.replace()
function. This eliminates dumb cases of matching partial words and
modifying them. Eg.: addres->address would modify addressable to
addresssable.
- Skip words that were already fixed by previous iteration.
Thanks to Bruce Cran <bruce@cran.org.uk> for reporting this issue.
Sets already use the __hash__() method of each object to decide if an
object is in it. When we use the sha1 we are therefore hashing twice.
The impact is on performance. Following the performance before and after
this patch to parse the entire Linux Kernel tree with a big exclude
list.
Before:
real 2m20.959s
user 2m16.888s
sys 0m1.386s
After:
real 1m35.169s
user 1m28.719s
sys 0m1.354s
In case the word to be fixed has apostrophes, codespell was not making
the right fix. E.g:
i) "doesn't" was read as two separated words: "doesn" and "t"
ii) "doesnt'" was read as "doesnt"
iii) "doens't" was read as two separated words: "doens" and "t"
(i) is not a big deal since the spelling is right. In (ii) the fix would
be obviously wrong, since the net result would be "doesn't'" since the
doesnt->doesn't would apply in this case. (iii) is even worse since the
doens->does rule would apply and the result would be "does't"
Adding apostrophe to the list of chars treated as word boundary (i) and
(iii) are fixed and new rules are added to the dictionary in order to
fix (ii).
If there are two words misspelled in a line, codespell was detecting
both, but when writing them to file only the latest was actually being
fixed. This is because we used the wrong line string to fix them.